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Abstract 

This paper proposes an architecture for deep neural networks with 
hidden layer branches that learn targets of lower hierarchy than final layer 
targets. The branches provide a channel for enforcing useful information 
in hidden layer which helps in attaining better accuracy, both for the final 
layer and hidden layers. The shared layers modify their weights using the 
gradients of all cost functions higher than the branching layer. This model 
provides a flexible inference system with many levels of targets which is 
modular and can be used efficiently in situations requiring different levels 
of results according to complexity. This paper applies the idea to a text 
classification task on 20 Newsgroups data set with two level of hierarchical 
targets and a comparison is made with training without the use of hidden 
layer branches. 


Author’s Note 

September, 2016 

This document essentially was (May 2015) a hasty write up of a project for 
a course on artificial neural networks during my undergraduate studies. I 
am adding this note here to point out mistakes which are detrimental to 
writings. I have kept the original content intact, adding only this box. 

Firstly, the document doesn’t really use the terms (information, deep 
networks) from the title well in the analysis. Talking about the idea it¬ 
self, there is a similar concept of auxiliary classifier in literature which uses 
[same] targets at lower levels to improve performance (See arXiv:1409.4842v 1 
[cs.CV] for example). -1 to literature review. Furthermore, the comparison 
is not rigorous enough to back up the claims and needs more meaningful 
test. 
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1 Introduction 


Deep neural networks aim at learning multiple level of features by using larger 
number of hidden layers as compared to shallow networks. Using many layers, 
higher order features can be automatically learned without the need of any do¬ 
main specific feature engineering. This makes them more generalized inference 
systems. They are effective at learning features from raw data which would have 
required much efforts to pre process in case of shallow networks, for example, 
a recent work (Zhang and LeCun 2015) demonstrated deep temporal convolu¬ 
tional networks to learn abstract text concepts from character level inputs. 

However, having multiple layers, deep networks are not easy to train. Few 
of the problems are, getting stuck in local optima, problem of vanishing gradi¬ 
ents etc. If the hyperparameters of networks are not engineered properly, deep 
networks also tend to overfit. The choice of activation functions (Glorot and 
Bengio 2010) as well as proper initialization of weights (Sutskever et al. 20131 
plays important role in the performance of deep networks. 

Several methods have been proposed to improve the performance of deep 
networks. Layer by layer training of Deep Belief Networks (Hinton et al. 20061 
uses unsupervised pre-training of the component Restricted Boltzmann Ma¬ 
chines (RBMs) and further supervised fine tuning of the whole network. Similar 
models have been presented (Bengio and Lamblin 2007 Ranzato et al. 2007) 
that pre-train the network layer by layer and then fine tune using supervised 
techniques. 

Unsupervised pre-training is shown to effectively works as a regularizer (Er- 
han et al. 2009 Erhan, Courville, and Vincent 2010) and increase the perfor¬ 


mance as compared to network with randomly initialized weights. 

This paper explores the idea of training a deep network to learn hierarchical 
targets in which lower level targets are learned from taps in lower hidden layers, 
while the highest level of target (which has highest details) is kept at the final 
layer of the network. The hypothesis is that this architecture should learn 
meaningful representation in hidden layers too, because of the branchings. This 
can be helpful since the same model can be used as an efficient inference system 
for any level of target, depending on the requirement. Also, the meaningful 
information content of hidden layer activations can be helpful in improving the 
overall performance of the network. 

The following section presents the proposed deep network with hidden layer 
branchings. Sectionprovides the experimental results on 20 Newsgroups data 
set 0 along with the details of the network used in the experiment. Section]^ 
contains the concluding remarks and scope of future work is given in Section 


2 Proposed Network Architecture 

In the proposed network, apart from the final target layer, one (or more) tar¬ 
get layer are branched from the hidden layers. A simple structure with one 

^The dataset can be downloaded here qwone.com/~jason/20Newsgroups/ 
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branching is shown in Figure The target layers are arranged in a hierarchical 
fashion with the most detailed targets being farthest form the input, while triv¬ 
ial targets closer to the input layer. The network will learn both the final layer 
outputs as well as hidden layer outputs. The following sub section explains the 
learning algorithm using the example network in Figure 


Filial output 



input 


Figure 1: Branched Network Structure 


2.1 Learning Algorithm 

The network learns using Stochastic Gradient Descent. There are two costs to 
minimize, the first being that of final target and second of hidden target. For 
the network shown in the Figure the network has a branch from the layer 
whose output is xb- Weights and biases from W b+i, bs-i-i to W M+ 2 ,bN +2 are 
updated using the final target layer cost function only, while Wh and bn are 
updated using only the hidden layer cost function. 

dC 

, , dC 

b* ^ b, - ??— (2) 

UUi 

Here, C is the hidden or final target cost function, depending on which 
weights are to be minimized. For the weights that are shared for both targets, 
i.e. weights and biases from Wi,bi to WBibs, the training uses both cost 
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function and an averaged update is done for these parameters. If final target 
cost is Cf and hidden target cost is Ch, then the updates are: 


+ ( 1 - 0 )^ 


I I / dCF N ctC//\ 


dCr 


db, db^ 


( 3 ) 

( 4 ) 


A value of a = 0.5 gives equal weights to both gradients. This value will be 
used in the experiment in this paper. 


2.2 Features of the network 

• Performance Representation of meaningful data in hidden layers gov¬ 
erned by the hidden layer branchings helps by providing features for higher 
layers and thus improves the overall performance of the network. 

• Hierarchical targets Different target branches, arranged in hierarchy 
of details, help in problems demanding scalability in level of details of 
targets. 

• Modularity The hidden layer targets lead to storage of meaningful con¬ 
tent in hidden layers and thus, the network can be separated (recombined) 
from (with) the branch joints without loss of the learned knowledge. 

3 Experimental Results 

Hidden layer taps can be exploited only if the problem has multiple and hierar¬ 
chical targets. It can also work when it is possible to degrade the resolution (or 
any other parameter related to details) of output to create hidden layer outputs. 
This section explores the performance of the proposed model on 20 Newsgroups 
dataset. 


3.1 Data set 

The data set has newsgroup posts from 20 newsgroups, thus resulting in a 20 
class classification problem. According to the newsgroup topics, the 20 classes 
were partitioned in 5 primitive classes (details are in Table . The final layer 
of the network is made to learn the 20 class targets, while the hidden layer 
branching is made to learn the cruder, 5 class targets. The dataset has 18846 
instances. Out of these, 14314 were selected for training, while the other 4532 
instances were kept for testing. 
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Primitive class 

Final class 

Newsgroup topic 


1 

comp, graphics 


2 

comp. os .ms-windows. misc 

1 

3 

comp.sys.ibm.pc. hardware 


4 

comp. sys. mac. hardware 


5 

comp, windows.X 


6 

rec. autos 

2 

7 

rec.motorcycles 


8 

rec.sport.baseball 


9 

rec.sport.hockey 


10 

sci.crypt 

Q 

11 

sci.electronics 


12 

sci.med 


13 

sci.space 


14 

talk.politics.guns 

4 

15 

talk, politics. mideast 


16 

talk.politics.misc 


17 

talk.religion.misc 


18 

alt. atheism 

5 

19 

misc.forsale 


20 

soc.religion.Christian 


Table 1: Classes in data set. Primitive classes are used for training hidden layer 
branches, while Final classes are used for training final layer 


3.2 Word2Vec preprocessing 


For representing text, a simple and popular model can be made using Bag of 
Words (BoW). In this, a vocabulary of words is built from the corpus, and each 
paragraph (or instance) is represented by a histogram of frequency of occur¬ 
rence of words from the vocabulary. Although being intuitive and simple, this 
representation has a major disadvantage while working with neural networks. 
The vocabulary length is usually very large, of the order of tens of thousands, 
while each chunk of text in consideration has only few of the possible words, 
which results in a very sparse representation. Such sparse input representation 
can lead to poor learning and high inefficiency in neural networks. A new tool, 
Word2Vec[^is used to represent words as dense vectors. 

Word2Vec is a tool for computing continuous distributed representation of 


^Pythtm 

(Rehufek 


adaptation 


20131 


here https://radimrehurek.coin/gensim/models/word2vec.html 
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words. It uses Continuous Bag of Words and Skip-gram methods to learn vector 
representations of words using a corpus (Mikolov et al. 2013b[ Mikolov et al. 


2013a I. The representations provided by Word2Vec group similar words closer 


in latent space. These vectors have properties like (Mikolov, Yih, and Zweig 


20131: 


v{'king') — v{'man') + v{'woman') « v{'queen') 

Here, v{'word') represents the vector of “word”. For the problem in hand, 
a Word2Vec model with 1000 dimensional vector output was trained using the 
entire dataset (removing English language stop words). For making a vector 
for representing each newsgroup post, all the words’ vectors in the post were 
averaged. 

3.3 Network Architecture 

The network used had 4 hidden layers. The number of neurons in the layers 
were: 


1000{input) 300 ^ 200 ^ 200 ^ 130 ^ 20{target) 

=t- 5{hiddentarget) 

From hidden layer 1 (with 300 neurons), a branch was created to learn hidden 
target. The weights and biases are: 

Wni bN for connections from layer — 1 to layer N. 

WH,bH for connections from hidden layer tap to hidden target. 

Rectified Linear Units (ReLUs) were chosen as the activation functions of 
neurons since they have less likelihood of vanishing gradient (Nair and Hinton 


20101. ReLU activation function is given by: 


f{x) = max{x, 0) 


( 5 ) 


The output layers (both final and hidden branch) used softmax logistic re¬ 
gression while the cost function was log multinomial loss. For hidden output 
cost function, L2 regularization was also added for weights of hidden layer 1. 
The training was performed using simple stochastic gradient descent using the 


algorithm explained in Section 2.1 with mini batch size of 256 and momentum 
value of 0.9. Since, the aim is comparison, no attempts were made to achieve 
higher than the state-of-the-art accuracies. 

The network was implemented using the Python library for Deep Neural 
Networks, kayak 

^Harvard Intelligent Probabilistic Systems (HIPS), https;//github. com/HIPS/Kayak 
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3.4 Performance 

Three training experiments were performed, as elaborated below: 

1. With simultaneous updates for the shared layers (100 epochs) + fine tuning 
(20 epochs) 

2. Without simultaneous updates for shared layer by ignoring gradients coming 
from hidden layer target (100 epochs) + fine tuning (20 epochs) 

3. Training only using the hidden layer target (100 epochs) + fine tuning (20 
epochs ) 

The fine tuning step only updates the hidden tap to hidden target weights 
and biases, WH,bH- This was performed to see the state of the losses of 
the network with respect to the hidden layer targets. All the three training 
experiments were performed with the same set of hyper-parameters and were 
repeated 20 times to account for the random variations. Values of mean training 
losses throughout the course of training were plotted using all 20 repetitions. 

The plot of training losses for final layer target in experiment 1 and 2 is shown 
in Figurej^ From the plot, simultaneous training is seemingly performing better 
than direct training involving only target cost function minimization. 



Figure 2: Mean final target losses during training. Errorbars represent one 
standard deviation. 

Plot of training losses for hidden layer target in all three experiments is 
given in Figure]^ Here, training with only minimization of final cost is not able 
to generate enough effective representation of data to help in minimization of 
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hidden cost function, while simultaneous training and training involving only 
hidden cost minimization are giving almost similar performance. The situation 
is clearer in Figure which is plot of losses for hidden target during the fine 
tuning process for all the three experiments. As this graph shows, training only 
with final target cost in consideration is not able minimize loss well as compared 
to other two methods. Also, curve of simultaneous training starts with lesser 
loss than curve of training with hidden cost only. This depicts better updates 
of weights in simultaneous training as compared to training with only hidden 
cost. 



Figure 3: Mean hidden target losses during training. Errorbars represent one 
standard deviation. 

Figure and show box plots of the accuracies over the 20 repeated exper¬ 
iments for hidden and hnal targets. 

Table shows the mean classification accuracy on final and hidden target for 
both training and testing set. As clear from the table and box plots, the simul¬ 
taneous training is providing better performance than other training methods. 


4 Conclusion 

This paper presented a branching architecture for neural networks that, when 
applied to appropriate problem with multiple level of outputs, inherently cause 
the hidden layers to store meaningful representations and helps in improving 
performance. The training curves showed that during simultaneous training, the 
shared layers were learning a representation that minimized both cost functions 
as well as had better weights for hidden targets. 
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Figure 4: Mean hidden target losses during fine tuning. Errorbars represent one 
standard deviation. 
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Figure 5: Boxplots for final target accuracies 
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Figure 6: Boxplots for hidden target accuracies 


The branches helps in enforcing information in hidden layers and thus the 
auxiliary branches can be added or removed easily from the network, this pro¬ 
vides flexibility in terms of modularity and scalability of network. 
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Hidden Target Accuracy 
Train (%) Test (%) 

Hidden Training 85.320316 (1.523254) 82.822154 (0.417403) 

Final Training 70.205044 (4.088195) 77.763680 (1.602464) 

Simultaneous Training 84.051977 (1.182006) 83.052736 (0.356259) 

Final Target Accuracy 
Train (%) Test (%) 

Hidden Training 4.998253 (1.453776) 5.015446 (1.446461) 

Final Training 74.332472 (2.295639) 69.088703 (1.325522) 

Simultaneous Training 76.824787 (1.792208) 69.205649 (1.183573) 

Table 2: Mean accuracies for the experiments. The values in parentheses are 
standard deviations. 

5 Future Work 

This key concept in the proposed architecture is to exploit the hidden layers by 
meaningful representations. Using a hierarchy of target, the proposed architec¬ 
ture can form meaningful hidden representations. 

An extended experiment can be done with many branches. Convolutional 
networks working on computer vision problems are ideal candidates for these 
tests, as it is easy to visualize the weights to hnd connections with the desired 
representations. Also, vision problems can be broken in many level of details 
and thus a hierarchy of outputs can be generated from single output layer. 

Whereas this paper focused on a problem involving branches from the hid¬ 
den layers, an exploration can be done in which few hidden neurons directly 
represent the hidden targets without any branching. Further, work can be done 
for construction of multiple level of outputs from single output. This can be 
useful for computer vision problems, where different level of outputs can be 
practically useful. 
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